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Abstract 

Cumulative prospect theory (CPT) is known to model human decisions well, with substantial empir¬ 
ical evidence supporting this claim. CPT works by distorting probabilities and is more general than the 
classic expected utility and coherent risk measures. We bring this idea to a risk-sensitive reinforcement 
learning (RL) setting and design algorithms for both estimation and control. The RL setting presents 
two particular challenges when CPT is applied: estimating the CPT objective requires estimations of the 
entire distribution of the value function and finding a randomized optimal policy. The estimation scheme 
that we propose uses the empirical distribution to estimate the CPT-value of a random variable. We then 
use this scheme in the inner loop of a CPT-value optimization procedure that is based on the well-known 
simulation optimization idea of simultaneous perturbation stochastic approximation fSPSA). We provide 
theoretical convergence guarantees for all the proposed algorithms and also illustrate the usefulness of 
CPT-based criteria in a traffic signal control application. 


1 Introduction 

Since the beginning of its history, mankind has been deeply immersed in designing and improving systems to 
serve humans needs. Policy makers are busy with designing systems that serve the education, transportation, 
economic, health and other needs of the public, while private sector enterprises or hard at creating and 
optimizing systems to serve further more specialized needs of their customers. While it has been long 
recognized that understanding human behavior is a prerequisite to best serving human needs (Simon 1959, 
e.g.,), it is only recently that this approach is gaining a wider recognition. 1 

In this paper we consider human-centered reinforcement learning problems where the reinforcement 
learning agent controls a system to produce long term outcomes (“return”) that are maximally aligned with 
the preferences of one or possibly multiple humans, an arrangement shown on Figure 1. As a running 
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1 As evidence for this wider recognition in the public sector, we can mention a recent executive order of the White House calling 
for the use of behavioral science in public policy making, or the establishment of the “Committee on Traveler Behavior and Values” 
in the Transportation Research Board in the US. 
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example, consider traffic optimization where the goal is to maximize travelers’ satisfaction, a challenging 
problem in big cities. In this example, the outcomes (“return”) are travel times, or delays. To capture human 
preferences, the outcomes are mapped to a single numerical quantity. While preferences of rational agents 
facing uncertain situations can be modeled using expected utilities (i.e., the expectation of a nonlinear trans¬ 
formation, such as the exponential function, of the rewards or costs) (Von Neumann and Morgenstern 1944; 
Fishburn 1970), it is well known that humans are subject to various emotional and cognitive biases, and, 
the psychology literature agrees that human preferences are inconsistent with expected utilities regardless 
of what nonlinearities are used (Allais 1953; Ellsberg 1961; Kahneman and Tversky 1979). An approach 
that gained strong support amongst psychologists, behavioral scientists and economists (e.g., Starmer 2000; 
Quiggin 2012) is based on Kahneman and Tversky (1979)’s celebrated prospect theory (PT). Therefore, in 
this work, we will base our models of human preferences on this theory. More precisely, we will use cumula¬ 
tive prospect theory (CPT), a later, refined variant of prospect theory due to Tversky and Kahneman (1992), 
which is even more empirically and theoretically supported than prospect theory (e.g., Barberis 2013). CPT 
generalizes expected utility theory in that in addition to having a utility function transforming the outcomes, 
another function is introduced which distorts the probabilities in the cumulative distribution function. As 
compared to prospect theory, CPT is monotone with respect to stochastic dominance, a property that is 
thought to be useful and (mostly) consistent with human preferences 2 . 



Figure 1: Operational flow of a human-based decision making system 


Our contributions: To our best knowledge, we are the first to investigate (and define) human-centered RL, 
and, in particular, this is the first work to combine CPT with RL. Although on the surface the combination 
may seem straightforward, in fact there are many research challenges that arise from trying to apply a CPT 
objective in the RL framework, as we will soon see. We outline these challenges as well as our solution 
approach below. 

The first challenge stems from the fact that the CPT-value assigned to a random variable is defined 
through a nonlinear transformation of certain cumulative distribution functions associated with the random 
variable (cf. Section 2 for the definition). Flence, even the problem of estimating the CPT-value given 
a random sample requires some effort. In this paper, we consider a natural quantile-based estimator and 
analyze its behavior. Under certain technical assumptions, we prove consistency and sample complexity 
bounds, the latter based on the Dvoretzky-Kiefer-Wolfowitz (DKW) theorem. As an example, we show that 
the sample complexity for estimating the CPT-value for Lipschitz probability distortion (so-called “weight”) 
functions is O (^), which coincides with the canonical rate for Monte Carlo-type schemes. Since weight- 
functions that fit well to human preferences are only Flolder continuous, we also consider this case and find 

2 See Appendix A for an introduction to PT/CPT and a description of the Allais paradox. 
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that (unsurprisingly) the sample complexity jumps to O where a £ (0,1] is the weight function’s 

Holder exponent. 

The work on estimating CPT-values forms the basis of the algorithms that we propose to maximize 
CPT-values based on interacting either with a real environment, or a simulator. We set up this problem 
as an instance of policy search: We consider smoothly parameterized policies whose parameters arc tuned 
via stochastic gradient ascent. For estimating gradients, we use two-point randomized gradient estimators, 
borrowed from simultaneous perturbation stochastic approximation (SPSA), a widely used algorithm in 
simulation optimization Fu (2015). Here a new challenge arises which is that we can only feed the two- 
point randomized gradient estimator with biased estimates of the CPT-value. To guarantee convergence, we 
propose a particular way of controlling the arising bias-variance tradeoff. 

To put things in context, risk-sensitive reinforcement learning problems arc generally hard to solve. 
For a discounted MDP, Sobel (1982) showed that there exists a Bellman equation for the variance of the 
return, but the underlying Bellman operator is not necessarily monotone and this rules out policy iteration 
as a solution approach for variance-constrained MDPs. Further, even if the transition dynamics are known, 
Mannor and Tsitsiklis (2013) show that finding a globally mean-variance optimal policy in a discounted 
MDP is NP-hard. For average reward MDPs, Filar et al. (1989) motivate a different notion of variance and 
then provide NP-hardness results for finding a globally variance-optimal policy. CVaR as a risk measure is 
equally complicated as the measure here is a conditional expectation, where the conditioning is on a low 
probability event. Apart from the hardness of finding CVaR-optimal solutions, estimating CVaR for a fixed 
policy in a typical RL setting itself is a challenge considering CVaR relates to rare events and to the best of 
our knowledge, there is no algorithm with theoretical guarantees to estimate CVaR without wasting a lot of 
samples. There arc proposals based on importance sampling (cf. Prashanth 2014; Tamar et al. 2014), but 
they lack theoretical guarantees. 

We derive a provably sample-efficient scheme for estimating the CPT-value (see next section for a 
precise definition) for a given policy and use this as the inner loop in a policy optimization scheme. Finally, 
we point out that the CPT-value that we define is a generalization of the above previous works in the sense 
that one can recover the regular value function and the risk measures such as VaR and CVaR by appropriate 
choices of a the distortions used in the definition of the CPT value. 

The work closest to ours is by Lin (2013), who proposes a CPT-measure for an abstract MDP setting. 
We differ from Lin (2013) in several ways: (i) We do not assume a nested structure for the CPT-value and 
this implies the lack of a Bellman equation for our CPT measure; (ii) we do not assume model information, 
i.e., we operate in a model-free RL setting. Moreover, we develop both estimation and control algorithms 
with convergence guarantees for the CPT-value function. 

The rest of the paper is organized as follows: In Section 2, we introduce the notion of CPT-value of a 
random variable X. In Section 3, we describe a quantile-based scheme for estimating the CPT-value. In 
Section 4, we present a gradient-based algorithm for optimizing the CPT-value. We present the simulation 
results for a traffic signal control application in Section 5 and finally, provide the concluding remarks in 
Section 6. Appendix A provides background material for CPT and Appendix B makes a special case of the 
CPT-value in a stochastic shortest path problem. We provide the proofs of convergence for all the proposed 
algorithms in Appendices C-D. Further, Appendix E describes a second-order algorithm for CPT-value 
optimization. 
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Figure 2: An example of a utility function. 


2 CPT-value 

For a real-valued random variable X, we introduce a “CPT-functional” that replaces the traditional expecta¬ 
tion operator. The CPT-value of the random variable X is defined as 

/»+oo /*+oo 

C U , W (X) = / w + (P(u + (X) > z))dz — / w~{P(u~(X) > z))dz , (1) 

Jo Jo 

where u = (n + , u~), w = ( w + ,w ~), n + , u~ : M — > M + and w + ,w~ : [0,1] —> [0,1] are continuous (see 
assumptions (A1)-(A2) in Section 3 for precise requirements on u and w). For notational convenience, since 
u. w will be fixed, we drop the dependence on u. w and use C(X) to denote the CPT-value. Fig. 2 shows 
an example of the utility functions u = ( u + , u~) and how they relate to each other, while Fig. 3 shows an 
example of a typical weight function. 

In the definition, u + , u~ are utility functions corresponding to gains {X > 0) and losses (X < 0), 
respectively. For example, consider a scenario where one can either earn $500 w.p 1 or earn $1000 w.p. 0.5 
(and nothing otherwise). The human tendency is to choose the former option of a certain gain. If we flip 
the situation, i.e., a certain loss of $500 or a loss of $1000 w.p. 0.5, then humans choose the latter option. 
Flandling losses and gains separately is a salient feature of CPT, and this addresses the tendency of humans 
to play safe with gains and take risks with losses - see Fig 2. In contrast, the traditional value function makes 
no such distinction between gains and losses. 

The functions w + , w~ , called the weight functions, capture the idea that humans deflate high-probabilities 
and inflate low-probabilities. For example, humans usually choose a stock that gives a large reward, e.g., 
one million dollars w.p. 1/10 6 over one that gives $1 w.p. 1 and the reverse when signs are flipped. Thus 
the value seen by the human subject is non-linear in the underlying probabilities - an observation backed 
by strong empirical evidence (Tversky and Kahneman 1992; Barberis 2013). In contrast.the traditional 
value function is linear in the underlying probabilities. As illustrated with w = w + = w~ in Fig 3, the 
weight functions are continuous, non-decreasing and have the range [0,1] with ru + (0) = uP (0) = 0 and 
'uA(l) = vX( 1) = 1. Tversky and Kahneman (1992) recommend 'tv(p) = - , 1+ ^ P . y^i/v ’ w hde Prelec 

(1998) recommends w(p) = exp(—(— lnp)^), with 0 < < 1. In both cases, the weight function has the 

inverted-s shape. 

A few remarks are in order. 

Remark 1. (RL applications) The CPT-value, as defined in (1), has several applications in RL. In general, 
for any problem setting, one can define the return for a given policy and then apply CPT-functional on the 
return. For instance, with a fixed policy, the r.v. X could be the total reward in a stochastic shortest path 


4 






Figure 3: An example of a weight function. 

problem or the infinite horizon cumulative reward in a discounted MDP or the long-run average reward in 
an MDP - See Appendix Bfor one such application. 

Remark 2. (Generalization) It is easy to see that the CPT-value is a generalization of the traditional ex¬ 
pectation, as a choice of identity map for the weight and utility functions in (1) recovers the expectation of 
X. It is also possible to get (1 ) to coincide with risk measures (e.g. VaR and CVaR) by appropriate choice 
of weight functions. 

Remark 3. (Sensitivity) Traditional EU-based approaches are sensitive to modeling errors as illustrated in 
the following example: Suppose stock A gains $10000 w.p 0.001 and loses nothing w.p. 0.999, while stock 
B surely gains 11. With the classic value function objective, it is optimal to invest in stock B as it returns 
11, while A returns 10 in expectation (assuming utility function to be the identity map). Now, if the gain 
probability for stock A was 0.002, then it is no longer optimal to invest in stock B and investing in stock A 
is optimal. Notice that a very slight change in the underlying probabilities resulted in a big difference in the 
investment strategy and a similar observation carries over to a multi-stage scenario (see the house buying 
example in the numerical experiments section). 

Using CPT makes sense because it inflates low probabilities and thus can account for modeling errors, 
especially considering that model information is unavailable in practice. Note also that in MDPs with 
expected utility objective, there exists a deterministic policy that is optimal. However, with CPT-value 
objective, the optimal policy is not necessarily deterministic - See also the organ transplant example on pp. 
75-81 of Lin (2013). 

3 CPT-value estimation 

Before diving into the details of CPT-value estimation, let us discuss the conditions necessary for the CPT- 
value to be well-defined. Observe that the first integral in (1), i.e., / 0 + °° w + (P(u + (X) > z))dz may diverge 
even if the first moment of random variable u + (X) is finite. For example, suppose U has the tail distribution 
function P(U > z) = 2 € [1, +oo), and w + (z) takes the form w(z) = z 3. Then, the first integral in 

(1), i.e., j' +oc -\dz does not even exist. A similar argument applies to the second integral in (1) as well. 

Z~5 

To overcome the above integrability issues, we make different assumptions on the weight and/or utility 
functions. In particular, we assume that the weight functions uX. w are either (i) Lipschitz continuous, 
or (ii) Holder continuous, or (Hi) locally Lipschitz. We devise a scheme for estimating (1) given only 
samples from X and show that, under each of the aforementioned assumptions, our estimator (presented 
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next) converges almost surely. We also provide sample complexity bounds assuming that the utility functions 
are bounded. 


3.1 Estimation scheme for Holder continuous weights 

Recall the Holder continuity property first in definition 1: 

Definition 1. ( Holder continuity ) If 0 < a < 1, a function f <G C([a,b]) is said to satisfy a Holder 
condition of order a (or to be Holder continuous of order a) if 3 II > 0, s.t. 


sup 

x^y 


I fix) ~ f(y) I 

\x — y\ a 


< H. 


In order to ensure integrability of the CPT-value (1), we make the following assumption: 

Assumption (Al). The weight functions w + ,w~ are Holder continuous with common order a. Further, 
EI 7 < a s.t, / 0 +o ° -P 7 (u + (A) > z)dz < +00 and / 0 + °° P' 1 (u~(X) > z)dz < + 00 . 

The above assumption ensures that the CPT-value as defined by (1) is finite - see Proposition 5 in 
Appendix C. 1 for a formal proof. 


Approximating CPT-value using quantiles: Let £+ denote the oth quantile of the r.v. u + (X). Then, it 
can be seen that (see Proposition 6 in Appendix C.l) 

(" ,+ (^) - “ + = l tt ’ +(F(,,+(x) >*)>*• w 

A similar claim holds with u~(X), in place of u + (X), respectively. Here £“ denotes the 

ath quantile of u~(X). 

However, we do not know the distribution of u + (X) or u~{X) and hence, we next present a procedure 
that uses order statistics for estimating quantiles and this in turn assists estimation of the CPT-value along 
the lines of (2). The estimation scheme is presented in Algorithm 1. 


Algorithm 1 CPT-value estimation for Holder continuous weights 


,u 


arc 


Simulate n i.i.d. samples from the distribution of X. 

Order the samples and label them as follows: -Xm, A™,..., X\ n y Note that u + (Xy -).., 
also in ascending order. 

Denote the statistic 


-(*m) 


n-l / . , 

-=■+ v —a x / 1/72 — % , i/72 — 2 — 1, 

C n : =^^ + (A w ) (m+(—)-m+(---, 

i =1 \ ' 


Apply u~ on the sequence {Im, Xt 2 ], • • •, Ar n i}, notice that u~(Xyj) is in descending order since u~ is 
a decreasing function. 

Denote the statistic 

n-l , . . _ x 

:=^2u-(X [{] )(w-(^-)-w-C—)j. 

1=1 ^ ' 

Return C n = — C n . 
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Main results 

Assumption (A2). The utility functions u + (X) and u~(X) arc continuous and strictly increasing. 

Assumption (A2’). In addition to (A2). the utility functions u + (X) and u~(X) arc bounded above by 

M < oo. 

For the sample complexity results below, we require (A2’), while (A2) is sufficient to prove asymptotic 
convergence. 

Proposition 1. (Asymptotic convergence.) Assume (Al) and also that F + (-),F~ (•) - the distribution func¬ 
tions ofu + (X), and u~ ( X ) are Lipschitz continuous with constants L + and L~, respectively, on the interval 
(0, Too), and (— 00 , 0). Then, we have that 

C n -r C(X) a.s. as n —> 00 (3) 

where C n is as defined in Algorithm 1 and C(X) as in (1). 

Proof See Appendix C.l. □ 

While the above result establishes that C n is an unbiased estimator in the asymptotic sense, it is impor¬ 
tant to know the rate at which the estimate C n converges to the CPT-value C(X). The following sample 
complexity result shows that O (ysrv 'j number of samples are required to be e-close to the CPT-value in 
high probability. 

Proposition 2. (Sample complexity.) Assume (Al) and (A2 ’). Then, Ve > 0, S > 0, we have 

1 ATT^ 1 A/[2 

P(|C n - C(X)\ <e)>5,Vn> ln(-) • 

Proof See Appendix C.l. □ 

3.1.1 Results for Lipschitz continuous weights 

In the previous section, it was shown that Holder continuous weights incur a sample complexity of order 
O (^ 27 ^) and this is higher than the canonical Monte Carlo rate of O (yi)- I n this section, we establish 
that one can achieve the canonical Monte Carlo rate if we consider Lipschitz continuous weights, i.e., the 
following assumption in place of (Al): 

Assumption (Al’). The weight functions w + ,w~ are Lipschitz with common constant L, and u + (X) 
and u~(X) both have bounded first moments. 

Setting a = 1, one can make special cases of the claims regarding asymptotic convergence and sample 
complexity of Proposition 1-2. However, these results arc under a restrictive Lipschitz assumption on the 
distribution functions of u + (X) and a (X). Using a different proof technique that employs the dominated 

convergence theorem and DKW inequalities, one can obtain results similar to Proposition 1-2 with (Al’) 

and (A2) only. The following claim makes this precise. 

Proposition 3. Assume (Al ’) and (A2). Then, we have that 

C n -r C(X) a.s. as n —> 00 

In addition, if we assume (A2’), we have Ve > 0, 5 > 0 

PflCn-CpOl < e) > <5 ,Vn > ln(^) 

Proof. See Appendix C.2. 


al 2 m 2 

e 2 ' 

□ 
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3.2 Estimation scheme for locally Lipschitz weights and discrete X 

Here we assume that the r.v. X is discrete valued. Let p l , i = 1 ,... ,K denote the probability of incurring 
a gain/loss Xi,i = 1,..., K, where x\ < ... < xi < 0 < xi+i < ... < 'J'k and let 

k K 

Fk = ^~2 'Pk if k < l and ^ pk if k > l. (4) 

i= 1 i=k 


Then, the CPT-value is defined as 


i 

C(X) =(u~(x 1 ))w~(pi) + '^2 u~(x i )(w-(F i ) - w~(Fi- 1 )) 

i=2 

K-l 

+ ^2 u + (xi)(w + {Fi) - w + (F i+ i2j + u + (x K )w + (p K ), 

i=l +1 

where u + , ir arc utility functions and w + ,w~ arc weight functions corresponding to gains and losses, 
respectively. The utility functions u + and u arc non-decreasing, while the weight functions arc continuous, 
non-decreasing and have the range [ 0 , 1 ] with m + ( 0 ) = w“( 0 ) = 0 and m + (l) = m“(l) = 1 . 

Estimation scheme. Let pk = p Ya=i I{U=x k } an d 

k K 

Fk = '22 / P k i fk<l and if k > l. (5) 

i= 1 i=k 


Then, we estimate C(X) as follows: 

i 

C n =u~(xi)iu~(pi)+'22u~(x i )^iu~(F i ) 

1=2 

K-l 

+ ^ u + (xi) (w + (Fi) - w + (F i+ i)J + u + (xk)w + (pk). (6) 

1 = 1 +1 

Assumption (A3). The weight functions w + (X) and w (A) arc locally Lipschitz continuous, i.e., for any 
x , there exist L < oo and 6 > 0, such that 

|u; + (.t) — m + (y)| < L x \x — y |, for all y G (x — 5, x + S). 

The main result for discrete-valued X is given below. 

Proposition 4. Assume (A3). Let L = rna x{Lk,k = 2...K}, where Lk is the local Lipschitz constant 
of function w~(x) at points Fk, where k = 1, ...l, and of function w + (x) at points k = l + 1, ...K. Let 
A = max{n“(xfc), k = 1...Z} U{ M+ (* T fc)i k = l + 5 = min{(5fc}, where 5k is the half the length 

of the interval centered at point Fk where the locally Lipschitz property with constant Lk holds. For any 
e, p > 0 , we have 

ln( —) 

P(|C n -CpO| <e) > 1 —p,Vn> (7) 


where M = min (5 2 , e 2 /(KLA) 2 ). 



In comparison to Propositions 2 and 3, observe that the sample complexity for discrete X scales with 
the local Lipschitz constant L and this can be much smaller the global Lipschitz constant of the weight 
functions or the weight functions may not be Lipschitz globally. 

Proof. See Section C.3. □ 

4 Gradient-based algorithm for CPT optimization (CPT-SPSA) 

Optimization objective: Suppose the r.v. X in (1) is a function of a d-dimensional parameter 9. The goal 
then is to solve the following problem: 


Find 9* = argmaxC(X 61 ), ( 8 ) 

eee 

where © is a compact and convex subset of W l . As mentioned earlier, the above problem encompasses 
policy optimization in an MDP that can be discounted or average or episodic and/or partially observed. The 
difference here is that we apply the CPT-functional to the return of a policy, while traditional approaches 
consider the expected return. 


4.1 Gradient estimation 


Given that we operate in a learning setting and only have biased estimates of the CPT-value from Algorithm 
1, we require a simulation scheme to estimate VC(X°). Simultaneous perturbation methods arc a general 
class of stochastic gradient schemes that optimize a function given only noisy sample values - see Bhatnagar 
et al. (2013) for a textbook introduction. SPSA is a well-known scheme that estimates the gradient using two 
sample values. In our context, at any iteration n of CPT-SPSA-G, with parameter 9 n , the gradient VC(X ° n ) 
is estimated as follows: For any i = 1,... ,d. 


V ?: C(X 0 ) 






2<L Ai 


(9) 


where S n is a positive scalar that satisfies (A3) below, A n = (A*,..., A^) T , where {A^, i = 1 ,..., d}, 
n = 1, 2,... are i.i.d. Rademacher, independent of 9q ..... 9 n and £ 9 " +5nAn ( rC sp. (T (A <5 " An ) denotes 
the CPT-value estimate that uses m n samples of the r.v. x° n+SnA ' n (resp. X°" SnAn y The (asymptotic) 
unbiasedness of the gradient estimate is proven in Lemma 5. 


4.2 Update rule 

We incrementally update the parameter 9 in the ascent direction as follows: For i = I...., d. 

^ +1 = r*(^ + 7nV i C(A^)), (10) 

where is a step-size chosen to satisfy (A3) below and F = (F| ; . F,/) is an operator that ensures that 
the update (10) stays bounded within a compact and convex set 0. Algorithm 2 presents the pseudocode. 
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Algorithm 2 Structure of CPT-SPSA-G algorithm. 

Input: initial parameter 6q e 0 where © is a compact and convex subset of W l , perturbation constants 
S n > 0 , sample sizes {rn n \, step-sizes { 7 n }, operator r : M. d -7 0 . 
for n = 0, 1 , 2 ,... do 

Generate {A( t , i = 1,..., d} using Rademacher distribution, independent of {A m , m = 0,1,..., n— 

1 }. 

CPT-value Estimation (Trajectory 1) 

Simulate m n samples using (0 n + S n A n ). 

Obtain CPT-value estimate ^ 9 ^ l+5nAn _ 

CPT-value Estimation (Trajectory 2) 

Simulate m n samples using (0 n — 5 n A n ). 

Obtain CPT-value estimate C^j l5nA ". 

Gradient Ascent 

Update 0 n using (10). 
end for 
Return 0 n . 


On the number of samples m n per iteration: The CPT-value estimation scheme is biased, i.e., providing 
samples with parameter 0 n at instant n, we obtain its CPT-value estimate as C(X ° n ) + e e n , with e n denoting 
the bias. The bias can be controlled by increasing the number of samples m n in each iteration of CPT-SPSA 
(see Algorithm 2). This is unlike classic simulation optimization settings where one only sees function 
evaluations with zero mean noise and there is no question of deciding on m n to control the bias as we have 
in our setting. 

To motivate the choice for m n , we first rewrite the update rule (10) as follows: 


0n +1 


=r* I 0 l n + 7n 


/C(X 6>n+5nAn ) — C(X 6>n_ ^ nAn ) 
V ^ 


A i 

n^n 


c^n~\~^n^r 


— 6 , 






Let = ^" =(| 7 /K/. Then, a critical requirement that allows us to ignore the bias term ( n is the following 
condition (see Lemma 1 in Chapter 2 of Borkar (2008)): 


sup (( n+ i - Cn) 0 as n -7 00 . 

/>o 

While Theorems 1-2 show that the bias e e is bounded above, to establish convergence of the policy gra¬ 
dient recursion (10), we increase the number of samples m n so that the bias vanishes asymptotically. The 
assumption below provides a condition on the increase rate of m n . 

Assumption (A3). The step-sizes y n and the perturbation constants S n arc positive Mn and satisfy 

7 n, tin 0 , — ^-^ 0 , ^ 7 n = 00 and ^ ji < 00. 

fO ri: S n n n n 

While the conditions on y n and S n arc standard for SPSA-based algorithms, the condition on m n is motivated 
by the earlier discussion. A simple choice that satisfies the above conditions is y n = uq /n, m n = rnyri 1 ' 
and S n = tio/n 1 , for some u, 7 > 0 with 7 > va/2. 
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4.3 Convergence result 

Theorem 1. Assume (A1 )-(A3) and also that C{X B ) is a continuously differentiable function of 0, for any 
0 G © 3 . Consider the ordinary differential equation {ODE): 

0\ = fi (-VC(X 0 *)) ,fori = l,...,d, 

where ti(f (0)) : = lim r d 9 + a f( e ))~ d t f or an y continuous /(•). Let 1C = {0 | Ty (VjC(X 9 )) = 0,Vi = 
0 : 4-0 n 

1,..., d}. Then, for 0 n governed by (10), we have 

0 n —>• K, a.s. as n —>• oo. 

Proof. See Appendix D. □ 

See Appendix E for a second-order CPT-value optimization scheme based on SPSA. 


5 Simulation Experiments 

We consider a traffic signal control application where the aim is to improve the road user experience by an 
adaptive traffic light control (TLC) algorithm. We apply the CPT-functional to the delay experienced by 
road users, since CPT captures realistically captures the attitude of the road users towards delays. We then 
optimize the CPT-value of the delay and contrast this approach with traditional expected delay optimizing 
algorithms. 

We consider a road network with J\f signalled lanes that are spread across junctions and At paths, where 
each path connects (uniquely) two edge nodes, where the traffic is generated - see Figure 4(a). At any 
instant n, let q 2 n and f' n denote the queue length and elapsed time since the lane turned red, for any lane 
i = I..... M. Let dn 3 denote the delay experienced by yth road user on ith path, for any i = 1,..., At and 
j = 1 ,,m, where n, denotes the number of road users on path i. We specify the various components 
of the traffic control MDP in the following. The state s n = (q 3 ,..., ctff, t l n ...., pf, dh ’ 1 ,..., dqf'" M ) T is 
a vector of lane-wise queue lengths, elapsed times and path-wise delays. The actions are the feasible sign 
configurations. Traffic lights that can be simultaneously switched to green form a sign configuration. 

We consider three different notions of return as follows: 

CPT: Let // ' be the proportion of road users along path i, for i = 1,.... A). Any road user along path 
i, will evaluate the delay he experiences in a manner that is captured well by CPT. Let A* be the delay r.v. 
for path i and let the corresponding CPT-value be C(X,). With the objective of maximizing the experience 
of road users across paths, the overall return to be optimized is given by 

M 

CPT(X ll ...,X M ) = Y^d l C(X i ). (11) 

?:=i 

EUT: Here we only use the utility functions u + and u to handle gains and losses, but do not distort 
probabilities. Thus, the EUT objective is defined as 

EUT(X 1; ...,Xm) = Y£±i (nu + (Xi) - E(u-(Xi )), 

where E(n + (Xj)) = / 0 + °° P(u + (Xi) > z)dz and E(n _ (Aj)) — f+°° P(u~(Xi) > z)dz , for i = 

AVG: This is similar to EUT, except that no distinction between gains and losses via utility functions 
nor distort using weights as in CPT. Thus, AVG(Xl, ..., Xjyf) = AEfX,). 

’In a typical RL setting, it is sufficient to assume that the policy is continuously differentiable in 8. 


11 





Bin 


(c) EUT-SPSA 



Bin 


(d) CPT-SPSA 


Figure 4: Flistogram of CPT-value of the average delay for three different algorithms (all based on SPSA): 
AVG uses plain sample means (no utility/weights), EUT uses utilities but no weights and CPT uses both 
utilites and weights. Note: larger values arc better. 


12 























































An important recommendation of CPT is to employ a reference point to calculate gains and losses. In 
our setting, we use path-wise delays obtained from a pre-timed TLC (cf. the Fixed TLCs in Prashanth and 
Bhatnagar (2011)) as the reference point. In other words, if the delay of any algorithm (say CPT-SPSA) 
is less than that of pre-timed TLC, then the (positive) difference in delays is perceived as a gain and in the 
complementary case, the delay difference is perceived as a loss. The d l n in the state s n arc to be understood 
as the delay difference to the pre-timed TLC. 

The underlying policy in all the algorithms that we implement follows a Boltzmann distribution and has 
the form ne(s,a) = ---Vs € S, Va € Vl(-s), where the features (j>(s,a) are chosen as in 

£«'ei(s) e s,a 

Prashanth and Bhatnagar (2012). 

We implement the following TLC algorithms: 

CPT-SPSA: This is the first-order algorithm with SPSA-based gradient estimates, as described in Al¬ 
gorithm 2. In particular, the estimation scheme in Algorithm 1 is invoked to estimate C (Xi) for each path 
i = 1,..., M., with ctn , j = 1,..., rii as the samples. 


EUT-SPSA: This is si mi lar to CPT-SPSA, except that weight functions w + (p) = w (p) = p. for 

p e [o, l]. 

AVG-SPSA: This is similar to CPT-SPSA, except that weight functions w + (p) = w~(p) = p. for 
P G [0,1]. 

For both CPT-SPSA and EUT-SPSA, we set the utility functions (see (1)) as follows: u + (x) = \x\ a , 
and u~(x ) = A|.x| CT , with A = 2.25 and a = 0.88. For CPT-SPSA, we set the weights as follows: w + (p) = 


o' '1 


(p’u-i-(i-p )’ 71 )’ 71 


j— and w ( p) = 


O ' 1 2 


(p r 12 + (l-p)V2)V2 


, , with rp = 0.61 and ?/2 = 0.69. These choices are based on 


median estimates given by Tversky and Kahneman (1992) and have been used earlier in a traffic application 
(see Gao et al. (2010)). For all the algorithms, we set S n = 1.9/n 0101 and a n = l/(n + 50) and this is 
motivated by standard guidelines - see Spall (2005). The initial point f?o is the d-dimensional vector of ones 
and Vi. the operator F, projects 0, onto the set [0.1,10.0]. 

The experiments involve two phases. A training phase where we run each algorithm for 200 iterations, 
widi each iteration involving two perturbed simulations, each of trajectory length 500. This is followed by 
a test phase where we fix the policy for each algorithm and 100 independent simulations of the MDP (each 
widi a trajectory length of 1000) are performed. After each run in the test phase, the overall CPT-value (11) 
is estimated. 

Figures 4(b)-4(d) present the histogram of the CPT-values from the test phase for AVG-SPSA, EUT- 
SPSA and CPT-SPSA, respectively. A similar exercise for pre-timed TLC resulted in a CPT-value of 
—46.14. It is evident that each algorithm converges to a different policy. However, the CPT-value of the 
resulting policies is highest in the case of CPT-SPSA followed by EUT-SPSA and AVG-SPSA in that order. 
Intuitively this is expected because AVG-SPSA uses neither utilities nor probability distortions, while EUT- 
SPSA distinguishes between gains and losses using utilities while not using weights to distort probabilities. 
The results in Figure 4 argue for specialized algorithms that incorporate CPT-based criteria, esp. in the light 
of previous findings which show CPT matches human evaluation well and there is a need for algorithms that 
serve human needs well. 


6 Conclusions and Future Work 

CPT has been a very popular paradigm for modeling human decisions among psychologists/economists, 
but has escaped the radar of the AI community. This work is the first step in incorporating CPT-based 
criteria into an RL framework. However, both estimation and control of CPT-based value is challenging. 
We proposed a quantile-based estimation scheme that converges at the optimal rate. Next, for the problem 
of control, since CPT-value does not conform to any Bellman equation, we employed SPSA - a popular 
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simulation optimization scheme and designed a first-order algorithm for optimizing the CPT-value. We 
provided theoretical convergence guarantees for all the proposed algorithms and illustrated the usefulness 
of CPT-based criteria in a traffic signal control application. 


Appendix 

A Background on CPT 

For a random variable X, let pi,i = 1,.... K denote the probability of incurring a gain/loss x t , i = 
1,... ,K. Given a utility function u and weighting function w. Prospect theory (PT) value is defined as 
C(X) = J2iL i u(xi)w(pi). The idea is to take an utility function that is 5-shaped, so that it satisfies the 
diminishing sensitivity property. If we take the weighting function w to be the identity, then one recovers the 
classic expected utility. A general weight function inflates low probabilities and deflates high probabilities 
and this has been shown to be close to the way humans make decisions (see Kahneman and Tversky (1979), 
Fennema and Wakker (1997) for a justification, in particular - via empirical tests using human subjects). How¬ 
ever, PT is lacking in some theoretical aspects as it violates first-order stochastic dominance. Consider the 
following example from Fennema and Wakker (1997): Suppose there are 20 prospects (outcomes) ranging 
from —10 to 180, each with probability 0.05. If the weight function is such that m(0.05) > 0.05, then it 
uniformly overweights all low-probability prospects and the resulting PT value is higher than the expected 
value 85. This violates stochastic dominance, since a shift in the probability mass from bad outcomes did 
not result in a better prospect. 

Cumulative prospect theory (CPT) Tversky and Kahneman (1992) uses a si mi lar measure as PT, except 
that the weights are a function of cumulative probabilities. First, separate the gains and losses as x\ < ... < 
xi < 0 < xi -|_i < ... < xk- Then, the CPT-value is defined as 

l i i—l 

C(X) =(iT(®i)) ■ w~(pi) + J2 u ~^Xi){w~{J2Pj) - w ~(^2pj)) 

i =2 3 = 1 3 =1 

I <-1 K K 

+ u+ ( x i)( w+ (52pj) ~ w+ ( p ^) + u+ ( x k) ■ w + {p K ), 

i=l+ 1 j=i j=i+ 1 

where u + ,u~ are utility functions and w + ,w~ are weight functions corresponding to gains and losses, 
respectively. The utility functions u + and u are non-decreasing, while the weight functions are continuous, 
non-decreasing and have the range [0,1] with m + (0) = w~{ 0) = 0 and u; + (l) = w~ ( 1) = 1 . Unlike PT, 
the CPT-value does not violate stochastic dominance. In the aforementioned example, increasing w~ (0.05) 
and w + (0.05) does not impact outcomes other than those on the extreme, i.e., —10 and 180, respectively. For 
instance, the weight for outcome 100 would be u> + (0.45) — m + (0.40). Thus, CPT formalizes the intuitive 
notion that humans are sensitive to extreme outcomes and relatively insensitive to intermediate ones. 

Allais paradox 

Suppose we have the following two traffic light switching policies: 

[Policy 1] A throughput (number of vehicles that reach destination per unit time) of 1000 w.p. 1. Let 
this be denoted by (1000,1). 

[Policy 2] (10000,0.1; 1000, 0.89; 100,0.01) i.e., throughputs 10000, 1000 and 100 with respective 
probabilities 0.1, 0.89 and 0.01. 

Humans usually choose Policy 1 over Policy 2. On the other hand, consider the following two policies: 
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[Policy 3] (100,0.89; 1000, 0.11) 

[Policy 4] (100,0.9; 10000, 0.1) 

Humans usually choose Policy 4 over Policy 3. 

We can now argue against using expected utility (EU) as an objective as follows: Let u be the utility 
function in EU. 


Policy 1 is preferred over Policy 2 
=> u(1000) > 0.1u(10000) + 0.89u(1000) + O.Olu(lOO) 

=> O.llu(lOOO) > O.lu(lOOOO) + O.Olu(lOO) (12) 

Policy 4 is preferred over Policy 3 
=> 0.89tt(100) + O.llu(lOOO) < 0.9u(100) + O.lii(lOOOO) 

=> O.llu(lOOO) < O.lu(lOOOO) + O.Olu(lOO) (13) 

And we have a contradiction from (12) and (13). 


B CPT-value in a Stochastic Shortest Path Setting 

We consider a stochastic shortest path (SSP) problem with states S = {0,..., £}, where 0 is a special 
reward-free absorbing state. A randomized policy it is a function that maps any state s € S onto a probability 
distribution over the actions A(s) in state s. As is standard in policy gradient algorithms, we parameterize it 
and assume it is continuously differentiable in its parameter 6 G W 1 . An episode is a simulated sample path 
using policy 6 that starts in state s° € S, visits { s j,..., s T _i} before ending in the absorbing state 0, where 
r is the first passage time to state 0. Let D e (s°) be a random variable (r.v) that denote the total reward from 
an episode, defined by 

T—1 

D 9 (s °) = ^2 r ( s m,a m ), 

m —0 

where the actions a m arc chosen using policy 6 and r(s m , a m ) is the single-stage reward in state s m € S 
when action a m € A(s m ) is chosen. 

Instead of the traditional RL objective for an SSP of maximizing the expected value E( I)°(A)), we 
adopt the CPT approach and aim to solve the following problem: 

maxC(P 9 (s 0 )), 

0S0 

where 0 is the set of admissible policies that are proper 4 and the CPT-value function C(l)°(.A)) is defined 
as 


C{D e {s 0 )) 



w + (P(u + (D e (s 0 ))) > z)dz 



(.P(u~(D d (s °))) > z)dz. 


(14) 


4 A policy 6 is proper if 0 is recurrent and all other states are transient for the Markov chain underlying 6. It is standard to assume 
that policies are proper in an SSP setting - cf. Bertsekas (2007). 
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C Proofs for CPT-value estimator 

C.l Holder continuous weights 

For proving Proposition 1 and 4, we require Hoeffding’s inequality, which is given below. 

Lemma 2. Let 1 j ,... Y n be independent random variables satisfying P(a < Yi < b) = 1, for each i, where 
a < b. Then for t > 0, 


n 




> nt) < 2 exp {—2 nt 2 /(b — a) 2 }. 


I i =1 i —1 

Proposition 5. Under (A1 ’), the CPT-value C(X) as defined by (14) is finite. 

Proof. Holder continuity of w + together with the fact that tu + (0) = 0 imply that 

/*+oo /*+oo /*+oo 

/ w + (P(u + (X) > t))dz <H P a (u + {X ) > z)dz <H P^{u + (X) > z)dz < +oo. 

Jo Jo Jo 

The second inequality is valid since P(u + (X) > z) < 1. The claim follows for the first integral in (14) and 
the finiteness of the second integral in (14) can be argued in an analogous fashion. □ 

Proposition 6. Assume (AP). Let £"j“ and denote the ^ th quantile of u + (X) and u~(X), respectively. 

n n 

Then, we have 


n— 1 


lim Y £t(w + (-—-) — w + (- —-)) = f w + (P(u + (X ) > z))dz < +oo, 

n n n J 0 


n 

Th 1 • . n-J-QQ 

lim Y --) — w ~{-— -)) = / w~(P(u~ (X) > z))dz < +oo 

n n n n J 0 


(15) 


Proof. We shall focus on proving the first part of equation (15). Consider the following linear combination 
of simple functions: 

n— 1 

< IS > 


i=0 


which will converge almost everywhere to the function w(P(u + (X) > t)) in the interval [0, +oo), and also 
notice that 


n— 1 


2=0 


Y'w + (-) • %+ t+ At) < w{P(u + (X) > t)), Vf <E [0,+oo). 

z ^ 77, Ls n—i—i >s n—i \ 


(17) 


n n 


The integral of (16) can be simplified as follows: 

„_{_oo ri 1 n 1 


to 


X>t-J K+ . ,+ ,](*) = ^m+(f) •(£+( — )-£+( ” l 1 )) 

n 'Sn-iJ J n n 

1=0 n n 1=0 


n 


n— 1 


£«t-( 


n — i , n — i — 1 

w; -) - -)). 




i =0 


n 


n 


(18) 


(19) 


The Holder continuity property assures the fact that linx )WOO [w+Cfif) — m + ( ra ~*~ 1 )| = 0, and the limit 
in (15) holds through a typical application of the dominated convergence theorem. The second paid of (15) 
can be justified in a similar fashion. □ 
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Proof of Proposition 1 

Proof. Without loss of generality, assume that Holder constant II is 1. We first prove that 

C+ —> C + (X) a.s. as n —> oo. 


Or equivalently, show that 

n— 1 


r+oo 


lim E u + (Xui)(iu + (— — * - ) — vj + ( --)) n ~ 5>00 > / w + (P(U > t))dt, w.p. 1 (20) 

n-»+oc ■' 1J n n /n 

j=i 

The main paid of the proof is concentrated on finding an upper bound of the probability 


P( 


n— 1 . . n—1 . . ^ 

E« + w • (- + (—) - - E ■ (- + (—) - - + (^^)) 

L J n n n n n 

2=1 2=1 


for any given e > 0. Observe that 

|n—1 

P{ 


n—1 


E « + (e) • (-+(—) - - E er ■ (-+(—) - ^C^ 1 )) 

1J n n n n n 

1=1 2=1 


n—1 

spdjt 

2=1 

n—1 

<E p ( 

2=1 

n—1 

= E p < 

2=1 

n—1 

<E p ( 

2=1 

n—1 

= E p < 

2=1 


+ (y \ t + n + t n * e+/'+r n + / n * 1\\ 

« (*[.]) • Or ) - w (—i—)) - £T • O™ (—i—)) 


n 


n 


n 


n 


> e), 

( 21 ) 

> e) 




+ (y \ t +r n *\ +r ri * 1\\ * 1\\ 

« (*[*|) • O(——) - «> (-1-)) - U • Or —- wj ---)) 


n 


n 


n 


n 


>-) ( 22 ) 

n 


(•u+(X w ) - tf) • (m + (—) - m + (^^)) 
1J n n n 




(»+(%) -d)--(-) 


n 






> 


■n 


1—a'' 


(23) 


Now we find the upper bound of the probability of a single item in the sum above, i.e., 

e 


P( 


u 


'W-et 


> 


n,P~ a P 

e 


= P(«. + (X W ) - > ^i^y) + P(» + (A- W ) - ft < 

We focus on the teim P(u + (X^) - ft > ^r^r)* Let W t = . .,t = 1- ,r. Using the 

n ^ n( 1 a ) 7 

fact that probability distribution function is non-decreasing, we obtain 

n 

p (“ + (A W ) -et > -j^) = P(E w > >"' d - nib)» 


£=1 

n 


P(J2 W t - n • [1 - F + (£i + - 7I —)] > n • [F+(0 + 

, - n Tl ^ 2 n 


t=l 
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Using the fact that EWt = 1 — F + (£t + n £ - a ) ) i n conjunction with Hoeffding's inequality, we obtain 

n 


P(E W t -n -[ 1 - F+(Ct + -Tj^y)] > n • [F+(£ + 

1 n 77A ' n 


e 

ra( 1_a ) 


) 


-]) <e" 2n,5 S 
n 


(24) 


where <5- = _F + (£j' + n (il a) ) — Since F + (x) is Lipschitz, we have that K < L + ■ (-^). Hence, we 

n 

obtain 




n (t-a) 


) < e 


-2n-L+ ;JT ^ y = e _ 2 n“-L+ 


(25) 


In a similar fashion, one can show that 


P(«+(X w )-£t < 

n 


e 

n( 1 ~“) 


) 


< e - 2 "“- i+e 


(26) 


Combining (25) and (26), we obtain 

< —^-) < 2 • e - 2n “- L+e , Vi e n n (o, l) 

n (t-a) > — v ’ 

Plugging the above in (23), we obtain 


P( 




n —1 . . , n —1 . . 

p(lE « + (^w) • (- + (—) - - E # • (- + (—) - - + ( n ^ 1 )) 

' L J n n n n n 

i =1 i=l 

< 2ra • e _2n “' L+ . 


> e 


(27) 


Notice that - n ■ e~ 2nCl ' L+e < oo since the sequence 2n • e -2n “' L+ will decrease more rapidly than 

the sequence ^, \/k > 1. 

By applying the Borel Cantelli lemma, we have that Ve > 0 


P( 


71—1 . . 71—1 . . 

E “ + W • (- + (—) - - E et ■ (-+(—) - - + (^^)) 

1J n n ' n n n 

i=i i=i 


> e, i.o.) = 0, 


which implies 

71—1 . . 71—1 . . 

+ / v \ ( J r( n ~ l \ +/ n “ l ~ 1\\ >+ / +/ n — + / n_ 7— 1 n-H-oo 

>«(*[.■]) • u ’ — - ^ -) - ;> £T • ^ — - w (- -► 0 W -P 1. 

1 J n n t—* n n n 

i=i i=i 


which proves (20). 

The proof of C~ —> C~ ( X ) follows in a si mi lar manner as above by replacing U(lr,i) by u~(X [ n _q), 
after observing that u~ is decreasing, which in turn implies that u~(X[ n _,i) is an estimate of the quantile 

£ 7 - □ 

n 


Proof of Proposition 2 

For proving Proposition 2, we require the following well-known inequality that provide a finite-time bound 
on the distance between empirical distribution and the true distribution: 
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Lemma 3. (Dvoretzky-Kiefer-Wolfowitz (DKW) inequality) 

Let F n (u ) = k J2i=i 1 ({u(Xi))<u) denote the empirical distribution of a r.v. U, with u{X i),... ,u(X n ) 
being sampled from the r.v u(X). The, for any n and e > 0, we have 

P( sup \F n (x) - F(x )| > e) < 2e" 2ne2 . 
xEM 


The reader is referred to Chapter 2 of Wasserman (2015) for more on empirical distributions in general 
and DKW inequality in particular. 

Proof We prove the w + part, and the w~ paid follows in a similar fashion. Since u + (X) is bounded above 
by M and w + is Holder-continuous, we have 


< 


coo COO 

/ w + (P(u + (X)) > t)dt — / m + (l — F+{t))dt 

Jo Jo 

pM pM 

/ w + (P(u + (X)) > t)dt — / w + (l — F+(t))dt 

Jo Jo 

pM 

/ H-\P{u + {X)<f)-F+{t)\ a dt 

Jo 


<HM sup 


P(u + (X)<t)-F+(t) 


Now, plugging in the DKW inequality, we obtain 

r+oo 


f + OO 


f w + (P(u + (X)) > t)dt — / ur (1 — Ff (t))dt\ > e 
0 Jo 

e C2/a) 

> e ) < e n 2 H 2 M 2 . 


< P [HM sup 

V teR 


(■ P(u + (X)<t)-F+(t ) 


(28) 


□ 


C.2 Lipschitz continuous weights 

Setting a = 7 = 1 in the proof of Proposition 3, it is easy to see that the CPT-value (14) is finite. 

Next, in order to prove the asymptotic convergence claim in Proposition 3, we require the dominated 
convergence theorem in its generalized form, which is provided below. 

Theorem 4. (Generalized Dominated Convergence theorem ) Let {f n }’^Li be a sequence of measurable 
functions on E that converge pointwise a.e. on a measurable space E to f. Suppose there is a sequence 
{ g n } of integrabie functions on E that converge pointwise a.e. on E to g such that \f n \ < g n for all n G N. 
If lim f E g n = $ E 9, then lim f E f n = J E f. 

n—>oc a J a n—>00 a ^ 

Proof. This is a standard result that can be found in any textbook on measure theory. For instance, see 
Theorem 2.3.11 in Athreya and Lahiri (2006). □ 

Proof of Proposition 3: Asymptotic convergence 

Proof. Notice the the following equivalence: 

^u + (A [i ])(m + (-— -)-w + (-— -)) = / w + (l- Ff(x))dx, 

n n J 0 
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and also, 


(-)- y; (^* L )) 

' L J n n 

i=i 


W (1 - F n ( x))dx , 


where PPix) and F n (x) is the empirical distribution of u + {X) and u (X). 

Thus, the CPT estimator C n in Algorithm 1 can be written equivalently as follows: 


C„ = 


A+OO /'*+00 

/ w + (1 — F+{x))dx — / m _ (l — F~(x))dx. 

J 0 do 


We first prove the asymptotic convergence claim for the first integral in (29), i.e., we show 

/*+00 /»+OO 

/ m + (l — F+(x))dx —> / io + (P(u + (2f) > x)dx. (30) 

do do 

Since w; + is Lipschitz continuous with constant L, we have almost surely that iu + ( I — F n (x)) < L( 1 — 
F n (x)), for all n and w + ((P(u + (X) > x)) < L ■ ( P(u + (X ) > x), since m + (0) = 0. 

Notice that the empirical distribution function F+ (x) generates a Stieltjes measure which takes mass 
1/non each of the sample points u + {Xi). 

We have 


r+oo 

/ ( P(u + {X ) > x))dx = E{u + {X )) 

do 


r-\-oo /*+oo /*oo 

/ (1 -F+(x))dx= / / dF n (t)dx. 

do do Jx 


Since F+{x) has bounded support on M Vn, the integral in (31) is finite. Applying Fubini’s theorem to the 
RHS of (31), we obtain 

/*+oo /*oo /»+oo /»£ /*+oo ^ 71 

/ / dF n (t)dx= / / dxdF n (t) = / tdF n (t) = ~y^u + (X^), (32) 

do Jx do do do n i=1 

where u + (2T[j]), i = 1 ,..., n denote the order statistics, i.e., rt+pfni) < . .. < u + (2fr n i). 

Now, notice that 


1 n 1 n 

-E“ + A.1> = -E“ + A-i) ^ E(U*(X)), 


From the foregoing. 


lim / L • (1 — F n (x))dx 

n—>• oo / n 


L- (P(m + (A) > x))ds. 


Hence, we have 


/*oo /*oo 

/ i«W(l - F„(a:))(ix / u/ + )(P(u + (.X’)) > x)dx. 

do do 


The claim in (30) now follows by invoking the generalized dominated convergence theorem by setting 
f n = m + (l — F+(x)) and g n = L ■ (1 — F n (x)), and noticing that L ■ (1 — F n {x)) L{P{u + (X) > x)) 
uniformly Vx. The latter fact is implied by the Glivenko-Cantelli theorem (cf. Chapter 2 of Wasserman 
(2015)). 
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Following similar arguments, it is easy to show that 


/-+00 /»+ OO 

/ m _ (l — F~(x))dx —;> / w~(P(u~(X)) > x)dx. 

Jo Jo 

The final claim regarding the almost sure convergence of C n to C(X) now follows. 

Proof of Proposition 3: Sample complexity 

Proof. Since u + (X) is bounded above by M and w + is Lipschitz with constant L, we have 

C+OO P+OO 

/ w + (P(n + (X)) > x)dx — / m + (l — F+(x))dx 

Jo Jo 

pM pM 

= / w + (P(u + (X)) > x)dx — / w + {l — Fn(x))dx 

Jo Jo 

pM 

< / L-\P{u + {X) <x)-F+(x)\dx 

Jo 

<LM sup P(u + (X) < x) - F+(x) . 

Now, plugging in the DKW inequality, we obtain 


( /*+00 /»+OO 

J iu + (P(u + (X )) > x)dx — j m + (l — F+{x))dx 


P 


< P ( LM sup 


> e/2 


(. P(u + (X ) < x) - F+(x) > e/2 < 2e~ n PrP 


Along similar lines, we obtain 

/*+oo 


f + OO 


□ 


(33) 


> e/2 < 2 e~ n XJM?. (34) 


/ w ( P(u (X)) > x)dx — w (1 — F n (x))dx 

1 0 Jo 

Combining (33) and (34), we obtain 

( p+oo p+O o \ 

J w + (P(u + (X)) > x)dx- J w + (l - F+(x))dx > e/2J 

> e/2 


+ P 


C+OO P+OO 

/ w~(P(u~(X)) > x)dx — / w~(1 — Ff (x))dx 

I o do 

«2 


<C 46 2 L‘ 2 M‘ 2 . 


And the claim follows. 

C.3 Proofs for discrete valued X 

Without loss of generality, assume w + = to = w, and let 

pfELtPfc if k <l 
k l ElkPk if k>i. 

The following proposition gives the rate at which F) : converges to F) : . 


□ 


(35) 
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Proposition 7. Let F k and F k be as defined in (4), (35), Then, we have that, for every e > 0, 

P(\F k ~F k \ >e)<2e- 2ne \ 

Proof. We focus on the case when k > l, while the case of k < / is proved in a similar fashion. Notice that 
when k > l, F k = I(Xi>x k )- Since the random variables X, arc independent of each other and for each i, 
arc bounded above by 1, we can apply Hoeffding’s inequality to obtain 


1 n 

P(F k -F k >e) = P( -]T/. 

n ‘ 




X! E ( I {Xi>x k }) > e ) 


P( ^ ^ F{d{Xj>Xk ,|) 

i =1 i=l 

c\—2ne 2 


The proof of Proposition 4 requires the following claim which gives the convergence rate under local 
Lipschitz weights. 

Proposition 8. Under conditions of Proposition 4, with F k and F k as defined in (4) and (35), we have 
K K 

P{ ^2u k w(F k ) - 22 u k w ( F k) > e) <K ■ (er b “ 2n + e~^‘ ln F KLA) ' 1 ),where 


u ( x k ) ifk<l 


u + (x k ) ifk>l. 


Proof. Observe that 


P{ y. u k w(F k ) -" 22 UkW ( Fk ^ > e ) = ^(U u kw{F k ) - u k w{F k ) > j-) 


<22 P ( u kw{F k ) - u k w(F k ) > —) 

fc=i 

Notice that \/k = 1, ....K \p k — 5,p k + 5), the function w is locally Lipschitz with common constant L. 
Therefore, for each k, we can decompose the probability as 

P( u k w(F k ) - u k w{F k )| > jx) 

= P{[ Fk ~ F k >5] f > |[[u fc m(F fc ) - u k w(F k )|] > j?) + p {[ F k - F k < 6} P|[|'U fe m(.F fc ) - u k w(F k ) ] > -^) 
< P(\F k ~ F k \ > S) + P([\F k -F k | < 5] - u k w{F k )|] > ^). 

According to the property of locally Lipschitz continuous, we have 

P([ F k - F k < 6] - u k w(F k ) ] > -^) 

< P(u k L F k - Ffc| > y) < e-^n/iKLuk) 2 < g-e-2 n/(KLA) 2 
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And similarly, 


P( 


Fk ~ F k 


> 5) < e~ s2/2n ,\/k. 


And as a result. 


P( 


I\ 


I\ 


22 u k w{F k ) - y; u k w(F k ) 


k= 1 


k =1 


A' 


> e) <22 P ( u k w(F k ) - u k w(F k ) 
k= 1 
K 

<' 

k= 1 

= A' • (e" 52 ' 2n + e -P-^/( KLA ) 2 ) 




^ ^g—<5 2 -2n + g—e 2 -2n/ (KLA) 2 ^j 


□ 


Proof of Proposition 4 

Proof. With u k as defined in (36), we need to prove that 

A' I< 


P( 


22 Uk ' ( u; (^fc) - w(P fc+ i)) - - w (F k+ 1)) 


2=1 


2=1 


< e) > 1 - p , Vn > ^ 


, (37) 


where w is Locally Lipschitz continuous with constants L \, ....Lk at the points F|..... Fk■ From a parallel 
argument to that in the proof of Proposition 8, it is easy to infer that 


P( 


K 


K 


22 u k w {F k +i) - S 22,u k w{F k+ 1 


2=1 


2=1 


> e) < K ■ (e~ s2 ' 2n + e ~P*P(KLA)^ 


Hence, 


P( 


K 


I\ 


22 Uk • ( w (F k ) - w(F k+1 )) - 22 Uk ' ( w ( F k) - w(F k+ i)) 


2=1 


2=1 


> e 


K 


K 


< P( 


+ P( 


22 u k ■ (w{F k )) ~22 Uk ' 


2=1 

K 


2=1 


>e/2) 


K 


22 uk ■ ( w (p k + 1 )) - 22 Uk ■ (^(-^+ 1 )) 


2=1 


2=1 


> e /2) 


< 2K(e~ s2 ' 2n + e -P 2n P KLA Y i \ 

The claim in (37) now follows. 


□ 


D Proofs for CPT-SPSA-G 

To prove the main result in Theorem 1, we first show, in the following lemma, that the gradient estimate 
using SPSA is only an order 0(6 2 ) term away from the true gradient. The proof differs from the correspond¬ 
ing claim for regular SPSA (see Lemma 1 in Spall (1992)) since we have a non-zero bias in the function 
evaluations, while the regular SPSA assumes the noise is zero-mean. Following this lemma, we complete 
the proof of Theorem 1 by invoking the well-known Kushner-Clark lemma Kushner and Clark (1978). 
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Lemma 5. Let J- n = cr(6 m , m < n), n > 1. Then, for any i = 1, d, we have almost surely, 


E 


^t) n -\-5 n A n (L^ n ^ nAn 


25nA^ 


X n 


- ViC(X e ") 


0 as n oo. 


( 38 ) 


Proof. Recall that the CPT-value estimation scheme is biased, i.e., providing samples with policy 6 , we 
obtain its CPT-value estimate as V B (xf) + e . Here e° denotes the bias. 

We claim 


E 


ff~,t^ n -\-8 n ^ n fine'll 

^-'n _ ^"n _ 

2S n Aj l 


T 

J n 


= E 


C(X 6n+SnAn ) - C(X' 
2<5„Aj l 


On ^ 


\X n 


+ E [rj n | X n \ , (39) 


where rj n = 


^0 n -\-S n A _ 

2<5„A^ 


is the bias arising out of the empirical distribution based CPT-value 


estimation scheme. From Proposition 2 and the fact that —^ -> 0 by assumption (A3), we have that r] n 

goes to zero asymptotically. In other words, 


«/2j 

m n S n 


E 


8n^n 

yyn _ ~ _ 

2S n A l n 


T 

J 77 


4e 


C(X 6n+5nAn ) -C(X 
25 n A l n 


On $n An 


I T 

J 77. 


(40) 


We now analyse the RHS of (40). By using suitable Taylor’s expansions, 

X2 

C ( X 0n+s n a„) = C ( X 0n ) + s n A J n VC(X 0n ) + —AhV 2 C(X d ")A n + 0(6%), 


r @ n 8 n ^n 


C{X 

From the above, it is easy to see that 

<C{X Bn+5nAn ) — <p^x en ~ &nAn 
2 K&n 


) = C(X e '») - S n Aj l VC(X Bn ) + -A T n V 2 C(X d ")A n + 0(6*). 


N 


a i 


-ViC(x dn )= E ^MX^)+0(S 2 ). 

Ain 




(I) 


Taking conditional expectation on both sides, we obtain 

- C ( X 0n+5 n A„) _ (£/ X 0n—t>nAn 


E 


2S n Ai 


Xn 


=ViC(X dn ) +E E 
=ViC(X dn ) + 0(6%). 


N 


A j 

At 


V,C (X d ") + 0(8 2 n ) 


(41) 


The first equality above follows from the fact that A n is distributed according to a d-dimensional vector of 
Rademacher random variables and is independent of T n . The second inequality follows by observing that 
A l n is independent of Ah, for any i,j = 1 ,d,j j- i. 

The claim follows by using the fact that S n —> 0 as n —>• oo. □ 
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Proof of Theorem 1 


Proof. We first rewrite the update rule (10) as follows: For i = 1 , d, 

@n +1 = + Jn(ViC(X dn ) + j3 n + £ n )j 


where 


Pn 


E 


V^n — 


y 2(5 n A^ 

_ ~ _ 

26 n Ai l 


P n - VC (A 0 "), and 


-E 


V^n _ ~ _ 

25'n.Af 


Tn 


(42) 


In the above, p n is the bias in the gradient estimate due to SPSA and f n is a martingale difference sequence.. 

Convergence of (42) can be inferred from Theorem 5.3.1 on pp. 191-196 of Kushner and Clark (1978), 
provided we verify the necessary assumptions given as (B1)-(B5) below: 

(Bl) VC(A' e ) is a continuous Revalued function. 

(B2) The sequence p n , n > 0 is a bounded random sequence with p n —>• 0 almost surely as n —>• oo. 

(B3) The step-sizes y T) , n > 0 satisfy y n —>• 0 as n — >• oc and n y n = oc. 

(B4) {£ n , n > 0} is a sequence such that for any e > 0, 

lim P sup 

n ~ >0 ° ym>n 

(B5) There exists a compact subset K which is the set of asymptotically stable equilibrium points for the 
following ODE: 

0\ = f, (-VC(T 0 ‘)) , for i = 1,..., d, (43) 


y ikik 


k=n 


> e = 0. 


In the following, we verify the above assumptions for the recursion (10): 

• (Bl) holds by assumption in our setting. 

• Lemma 5 above establishes that the bias ;3 n is 0(6%) and since 5 n —> 0 as n —>• oo, it is easy to see 
that (B2) is satisfied for p n . 


• (B3) holds by assumption (A3). 


• We verify (B4) using arguments similar to those used in Spall (1992) for the classic SPSA algorithm: 
We first recall Doob’s martingale inequality (see (2.1.7) on pp. 27 of Kushner and Clark (1978)): 

P (sup || Wi|| > e) < 4 lim E || W { || 2 . (44) 

\m >0 J e l ^°° 

Applying the above inequality to the martingale sequence {if 7 /}, where IT 7 / := 7 n r/ n , l > 1, we 

obtain 


P 


sup 

\l>k 


l 

] 7n£n 

n=k 



< 



^ ^ 7 n£n 

n=k 


1 oo 


(45) 
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The last equality above follows by observing that, for m < n, E(£ m £ n ) = E(£ m E(£ n | J- n )) = 0. We 
now bound E ||£ n || 2 as follows: 


E||U 2 <E 




~ ^n 


2 SnAi 


2 


< 



/T-i @ n $ n A n 



1 



E 


( A j i )2 + 2a 1 



1 

1+al 



X 


E 


V^n 


1 2+2a!2 


1 + 0=2 


+ 


E 


V^n 


A n ,"| 2+2aT 


l+a2 


(46) 

(47) 


(48) 


< 


4 S 2 

c 


E 


t ^o n +s n a„o 2+2 “ 2 

V^n 


1+q:2 


+ 


E 


_ a _A A 1 2 +2«2 

//r^ u n ( Jn L -^n \ I 

V^n 


l+o 2 


< —. for some C < oc. 
o- 


(49) 

(50) 


The inequality in (46) uses the fact that, for any random variable X, E \\X — E[X | J>,] | 2 < EX 2 . 
The inequality in (47) follows by the fact that E(X+Y) 2 < ((EX 2 ) 1 / 2 + (EY 2 ) 1 / 2 )". The inequality 
in (48) uses Holder’s inequality, with aq, 07 > 0 satisfying + 1+ 1 Q ) = 1. The equality in (49) 

above follows owing to the fact that E ^ : 2 ,., j = 1 as A), is Rademacher. The inequality in 

(50) follows by using the fact that C (D 9 ) is bounded for any policy 6 and the bias e° is bounded by 
Proposition 2. 

Thus, E 1111 2 < -£■ for some C < 00 . Plugging this in (45), we obtain 


lim P sup 


k —S*-00 


l>k 


E 7+r 


n=k 


dC 


OO 9 

7, 2 


>e lim V =0 

/ e 2 oc 5 2 

n=k 


The equality above follows from (A3) in the main paper. 


• Observe that C(X e ) serves as a strict Lyapunov function for the ODE (43). This can be seen as 
follows: 

dC ^ 0) = VC(X 9 )0 = VC(X 9 )f (-VC(X e ) < 0. 

Hence, the set /C = {6 | f,; (—VC(X 0 )) = 0,Vz = 1,... ,d} serves as the asymptotically stable 
attractor for the ODE (43). 


The claim follows from the Kushner-Clark lemma. 


□ 


E Newton algorithm for CPT-value optimization (CPT-SPSA-N) 

E.l Need for second-order methods 

While stochastic gradient descent methods are useful in minimizing the CPT-value given biased estimates, 
they are sensitive to the choice of the step-size sequence { 7 ™}. In particular, for a step-size choice 7 n = 
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70 /n, if a o is not chosen to be greater than 1 /3A mM) (V 2 C(A'^* )), then the optimum rate of convergence 
is not achieved, where A m i n denotes the minimum eigenvalue, while 0* G /C (see Theorem 1). A standard 
approach to overcome this step-size dependency is to use iterate averaging, suggested independently by 
Polyak Polyak and Juditsky (1992) and Ruppert Ruppert (1991). The idea is to use larger step-sizes y n = 
1/rA, where <, G (1/2,1), and then combine it with averaging of the iterates. However, it is well known 
that iterate averaging is optimal only in an asymptotic sense, while finite-time bounds show that the initial 
condition is not forgotten sub-exponentially fast (see Theorem 2.2 in Fathi and Frikha (2013)). Thus, it is 
optimal to average iterates only after a sufficient number of iterations have passed and all the iterates are 
very close to the optimum. However, the latter situation serves as a stopping condition in practice. 

An alternative approach is to employ step-sizes of the form y n = ( ao/n)M n , where M n converges to 
(V 2 C(A 0 *)) \ i.e., the inverse of the Hessian of the CPT-value at the optimum 6*. Such a scheme gets 
rid of the step-size dependency (one can set a o = 1) and still obtains optimal convergence rates. This is the 
motivation behind having a second-order optimization scheme. 


E.2 Gradient and Hessian estimation 


We estimate the Hessian of the CPT-value function using the scheme suggested by Bhatnagar and Prashanth 
(2015). As in the first-order method, we use Rademacher random variables to simultaneously perturb all 
the coordinates. However, in this case, we require three system trajectories with corresponding parameters 
9 n + 5 n (A n + A„), 9 n — 5 n (A n + A„) and 6 n , where {A/, A l n , i = 1,..., d} are i.i.d. Rademacher and 
independent of do, ■ ■ ■, 9 n . Using the CPT-value estimates for the aforementioned parameters, we estimate 
the Hessian and the gradient of the CPT-value function as follows: For i,j = 1,..., d, set 


(P 1 ^ 71 —^n(A n +A n ) 

ViC (x°r) = - : 

ZO n L\ n 

(An+An (f^ n &n(A n -\-A n 0(f^ n 


A2 A i 

°n LA n LA n 


Notice that the above estimates require three samples, while the second-order SPSA algorithm proposed 
first in Spall (2000) required four. Both the gradient estimate VC(X./ a ) = [VjC(X^ n )], i = 1..... r/, and 
the Hessian estimate H n = [H% 3 ],i,j = 1,..., d, can be shown to be an 0(()/) term away from the true 
gradient VC(X„) and Hessian V 2 C( A./), respectively (see Lemmas 7-8). 


E.3 Update rule 

We update the parameter incrementally using a Newton decrement as follows: For i = 1...., d. 


oU i 


=r* 


^ + 7n^M^V j C(A^ 

3 =1 


H n —(1 £,n)H n —i + £ n H n 


(51) 

(52) 


where £ n is a step-size sequence that satisfies = °°> Yin < 00 an d ^ —>• 0 as n —> oo. These 

conditions on ^ n ensure that the updates to //,, proceed on a timescale that is faster than that of 9 n in (51) - 
see Chapter 6 of Borkar (2008). Further, F is a projection operator as in CPT-SPSA-G and M n = = 

T (Hn)- 1 . Notice that we invert H n in each iteration, and to ensure that this inversion is feasible (so that 
the 0-recursion descends), we project H n onto the set of positive definite matrices using the operator T. 
The operator has to be such that asymptotically T (H n ) should be the same as //„ (since the latter would 
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Algorithm 3 Structure of CPT-SPSA-N algorithm. 

Input: initial parameter 9q e 0 where © is a compact and convex subset of W l , perturbation constants 
S n > 0 , sample sizes {m n }, step-sizes { 7 min}, operator F : —> 0 . 

for n = 0, 1 , 2 ,... do 

Generate { A l n , A^, i = 1,..., d) using Rademacher distribution, independent of {A m , A m , m = 

0,1,... ,n - 1}. 

CPT-valne Estimation (Trajectory 1) 

Simulate m n samples using parameter (9 n + S n (A n + A„)). 

Obtain CPT-value estimate 
CPT-valne Estimation (Trajectory 2) 

Simulate m n samples using parameter (9 n — 5 n (A n + A„)). 

Obtain CPT-value estimate C^” l5n ( An+A ")_ 

CPT-valne Estimation (Trajectory 3) 

Simulate m n samples using parameter 0 n . 

_ Q 

Obtain CPT-value estimate C n n using Algorithm 1. 

Newton step 

Update the parameter and Hessian according to (51)-(52). 

end for 
Return 9 n . 


converge to the true Hessian), while ensuring inversion is feasible in the initial iterations. The assumption 
below makes these requirements precise. 

Assumption (A4). For any {A n } and {B n \, lim \\A n — B n II = 0 lim || T (A n ) — T (B n ) || = 0. 

n—>00 n—>00 

Further, for any {C n } with sup || C n || < 00 , sup (|| T (C n ) || + || {T(C n )} _1 ||) < 00 . 

n n 

A simple way to ensure the above is to have T(-) as a diagonal matrix and then add a positive scalar S n to 
the diagonal elements so as to ensure invertibility - see Gill et al. (1981), Spall (2000) for a similar operator. 
Algorithm 3 presents the pseudocode. 


E.4 Convergence result 


Theorem 6. Assume (A1)-(A4). Consider the ODE: 

e\ = ?i (-T(V 2 C(X 0t )) -1 VC(A e ‘)) 5 for i = 1,... ,d, 

where T\ is as defined in Theorem 1. Let 1C = {0 € S \ VC(A 0, )fj (-T(V 2 C(X e ))- 1 VC(A 0i )) = 
0, Vi = 1,..., d}. Then, for 0 n governed by (51), we have 

9 n —y 1C a.s. as n —> 00 . 

Proof Before proving Theorem 6, we bound the bias in the SPSA based estimate of the Hessian in the 
following lemma. 


Lemma 7. For any i,j = 1,,d, we have almost surely, 
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^dtn~\-&n /\ n ) -{-A n ) 2(0^ 


SZALAh 


T 

o rt. 


- V 2 jC(A 0n ) 


0 as n —> oo. 


(53) 
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Proof. As in the proof of Lemma 5, we can ignore the bias from the CPT-value estimation scheme and 
conclude that 


E 


^ (An ~h A n ) 


T 

•J 71 


4E 


(£(Pf0n+<5n(A„+A„)) _j_ f£(X0 n -8„(A n +A n )^ _ 2C(X dn ) 


82 A i AJ 
°n a n n n 


J n 


(54) 


Now, the RHS of (54) approximates the true gradient with only an 0(5^) error; this can be inferred using 
arguments similar to those used in the proof of Proposition 4.2 of Bhatnagar and Prashanth (2015). We 
provide the proof here for the sake of completeness. Using Taylor’s expansion as in Lemma 5, we obtain 


(C(pf6L+5„(A„+A n )) _j_ (£Vjf0n-‘MAn+A n L _ 2<C(X dn ) 


82 A i /VJ 
u n / -^n Lx n 


(A n + A„) T V 2 C(X^)(A n + A n ) 


+ 0{5\ 


Ai(n)Aj(n) 

^ ^ A^ m C(A^)A™ o ^ ^ A l n Vl m C(X°n)A™ ^ ^ A l n V z 2 m C(A^)A- 
2^ 2^ A i xj +2 2^ 2^ a * aj + 2^ + u V" 

1=1 m= 1 i=l rn =1 l=\ m=l 


Ai /V 

a n a n 


Taking conditional expectation, we observe that the first and last term above become zero, while the second 
term becomes V?- C(X 9n ). The claim follows by using the fact that 6 n —> 0 as n —* oo. □ 

Lemma 8. For any i = 1,..., d, we have almost surely, 
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~^0n-\-8n (A n + A n ) Sn (A n +A n ) 


2S n Ai 


Tn 


- ViC(A e ") 


0 as n -> oo. 


(55) 


Proof. As in the proof of Lemma 5, we can ignore the bias from the CPT-value estimation scheme and 
conclude that 


-p0 n +<5 n (A n +A n ) -p# n —(5 n (A n +A n ) 

jg | _ 

" 1 24„a;, 


Tn 


4e 


C(J0„+4 A „) _ C(X 
2<5 n A*j 




\Xn 


The rest of the proof amounts to showing that the RHS of the above approximates the true gradient with an 
O(d'f) correcting term; this can be done in a si mi lar manner as the proof of Lemma 5. □ 

Proof of Theorem 6 

Before we prove Theorem 6, we show that the Hessian recursion (52) converges to the true Hessian, for any 
policy 9. 

Lemma 9. For any i,j = 1 ,,d, we have almost surely, 


//;/•' - v? .<c(A e «) 


0, and 


T {H n )~ l - T(V^C(X y ")) 


6L\a-1 


0. 


Proof. Follows in a similar manner as in the proofs of Lemmas 7.10 and 7.11 of Bhatnagar et al. (2013). □ 
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Proof. (Theorem 6) The proof follows in a si mi lar manner as the proof of Theorem 7.1 in Bhatnagar et al. 
(2013); we provide a sketch below for the sake of completeness. 

We first rewrite the recursion (51) as follows: For i = 1, ..., d 


0n +1 


=r < 


, + 7 

7 = 1 


(9 n )VjC(X u n ) + 7n C„ 





(56) 


where 


M i ’ j (9) =T(X 2 C(X e ))~ l 

n —1 d 

Xn = ^ ^ Xm ^ ^ Mi^kiOm) | 


C{X- 


@m 8mAm 8mAm 


~\~8mAm ^ 


m =0 k =1 


25mf^m 


-E 


C(X 


$ra 8m 8m Am 


^ _ (0 ^^dm~\~8mAm~\~8mAm ^ 




^^n~\~8n{A n -\-A n ) (^0 n — 8 n {A n -\- A n ) 

Cn =E 1 n n 


2<y„A*„ 


T 
J n 


T 

J rr 


- ViC(X 6n ). 


and 


In lieu of Lemmas 7-9, it is easy to conclude that ( n — > 0 as n —> oo, Xn is a martingale difference sequence 
and that Xn+i — Xn 0 as n —)> oo. Thus, it is easy to see that (56) is a discretization of the ODE: 

9\ = f ■ ^-VC(X 0 ‘)T(V 2 C(2f 0t ))- 1 VC(X e ‘)) . (57) 

Since C(X 9 ) serves as a Lyapunov function for the ODE (57), it is easy to see that the set 

1C = {9 \ XC(X ei )ti (-?('V 2 C(X e ))~ 1 VC(X 0i ) > ) = 0, Vi = 1,..., d} is an asymptotically stable 

attractor set for the ODE (57). The claim now follows from Kushner-Clark lemma. □ 

□ 
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